LLM 25일 코스 - Day 4: Transformer 아키텍처 완벽 이해

Day 4: Transformer 아키텍처 완벽 이해

2017년 “Attention is All You Need” 논문에서 등장한 Transformer는 현대 LLM의 근간입니다. RNN/LSTM을 완전히 대체하고, 병렬 처리가 가능하여 대규모 학습에 적합합니다.

Transformer 전체 구조

입력 텍스트 → [토큰 임베딩 + 위치 인코딩]
                    ↓
         ┌─────────────────────┐
         │    인코더 블록 (×N)   │
         │  ┌─────────────────┐ │
         │  │ Self-Attention   │ │
         │  │ + Residual + LN  │ │
         │  ├─────────────────┤ │
         │  │ Feed-Forward     │ │
         │  │ + Residual + LN  │ │
         │  └─────────────────┘ │
         └─────────────────────┘
                    ↓
         ┌─────────────────────┐
         │    디코더 블록 (×N)   │
         │  ┌─────────────────┐ │
         │  │ Masked Self-Attn │ │
         │  ├─────────────────┤ │
         │  │ Cross-Attention  │ │
         │  ├─────────────────┤ │
         │  │ Feed-Forward     │ │
         │  └─────────────────┘ │
         └─────────────────────┘
                    ↓
              출력 텍스트

GPT 시리즈는 디코더만 사용하고, BERT는 인코더만 사용합니다.

Self-Attention 간소화 구현

import numpy as np

def self_attention(query, key, value):
    """Scaled Dot-Product Attention"""
    d_k = query.shape[-1]

    # 1. Query와 Key의 내적으로 유사도 계산
    scores = np.matmul(query, key.T) / np.sqrt(d_k)

    # 2. Softmax로 가중치 정규화
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    # 3. 가중치를 Value에 적용
    output = np.matmul(attention_weights, value)
    return output, attention_weights

# 3개 토큰, 4차원 벡터
seq_length, d_model = 3, 4
x = np.random.randn(seq_length, d_model)

output, weights = self_attention(x, x, x)
print(f"어텐션 가중치:\n{weights.round(3)}")
print(f"출력 형태: {output.shape}")  # (3, 4)

Feed-Forward Network와 Layer Normalization

import numpy as np

def layer_norm(x, eps=1e-6):
    """Layer Normalization: 각 토큰 벡터를 정규화"""
    mean = np.mean(x, axis=-1, keepdims=True)
    std = np.std(x, axis=-1, keepdims=True)
    return (x - mean) / (std + eps)

def feed_forward(x, w1, b1, w2, b2):
    """Position-wise Feed-Forward Network"""
    # 차원을 확장했다가 다시 축소 (보통 4배 확장)
    hidden = np.maximum(0, np.matmul(x, w1) + b1)  # ReLU 활성화
    output = np.matmul(hidden, w2) + b2
    return output

def transformer_block(x, w1, b1, w2, b2):
    """하나의 Transformer 블록"""
    # Self-Attention + Residual Connection + Layer Norm
    attn_output, _ = self_attention(x, x, x)
    x = layer_norm(x + attn_output)  # 잔차 연결

    # Feed-Forward + Residual Connection + Layer Norm
    ff_output = feed_forward(x, w1, b1, w2, b2)
    x = layer_norm(x + ff_output)    # 잔차 연결
    return x

# 초기화 및 실행
d_model, d_ff = 4, 16
w1 = np.random.randn(d_model, d_ff) * 0.1
b1 = np.zeros(d_ff)
w2 = np.random.randn(d_ff, d_model) * 0.1
b2 = np.zeros(d_model)

x = np.random.randn(3, d_model)
output = transformer_block(x, w1, b1, w2, b2)
print(f"입력 형태: {x.shape}, 출력 형태: {output.shape}")

핵심 구성 요소 정리

구성 요소	역할	핵심 아이디어
Self-Attention	토큰 간 관계 파악	모든 토큰이 서로를 참조
Feed-Forward	비선형 변환	차원 확장 후 축소
Layer Normalization	학습 안정화	각 레이어 출력을 정규화
Residual Connection	그래디언트 흐름 보장	입력을 출력에 더함
위치 인코딩	토큰 순서 정보 제공	순서 정보가 없으면 단어 순서 무시

Transformer는 이 다섯 가지가 조합된 것입니다. 내일은 가장 핵심인 Attention 메커니즘을 더 깊이 파고듭니다.

오늘의 연습문제

Residual Connection(잔차 연결)이 없으면 왜 깊은 네트워크를 학습하기 어려운지 설명해보세요. 그래디언트 소실 문제와 연결하여 답하세요.
위 transformer_block 함수를 6번 반복 쌓아서 6-레이어 Transformer를 만들어보세요. 입출력 형태가 어떻게 유지되는지 확인하세요.
인코더 전용(BERT), 디코더 전용(GPT), 인코더-디코더(T5)가 각각 어떤 태스크에 적합한지 정리해보세요.